Interactive Scientific Image Analysis using Spark

4Quant
Java User Group , 11 January 2016

Outline

  • Imaging in 2016
  • The Problem(s)

Solutions

  • Big Data Approaches
  • Spark Imaging Layer
    • 3D Imaging
    • Hyperspectral Imaging
    • Interactive Analysis / Streaming

The Challenges / Science

Mammography Satellite
Mammography Satellite Images
Brain Tumors Throat Cancer
Brain Tumor Satellite Images

Image Science in 2016: More and faster

X-Ray

  • Swiss Light Source (SRXTM) images at (>1000fps) \( \rightarrow \) 8GB/s, diffraction patterns (cSAXS) at 30GB/s
  • Nanoscopium (Soleil), 10TB/day, 10-500GB file sizes, very heterogenous data

Optical

  • Light-sheet microscopy (see talk of Jeremy Freeman) produces images \( \rightarrow \) 500MB/s
  • High-speed confocal images at (>200fps) \( \rightarrow \) 78Mb/s

Geospatial

  • New satellite projects (Skybox, etc) will measure hundreds of terabytes to petabytes of images a year

Personal

  • GoPro 4 Black - 60MB/s (3840 x 2160 x 30fps) for $600
  • fps1000 - 400MB/s (640 x 480 x 840 fps) for $400
    plot of chunk time-figure

Synchrotron X-Ray Microscopy

Recent improvements allow you to:

  • peer deep into large samples
  • achieve \( \mathbf{<1\mu m} \) isotropic spatial resolution
    • with 1.8mm field of view
  • achieve >10 Hz temporal resolution
  • 8GB/s of images

SLS and TOMCAT

[1] Mokso et al., J. Phys. D, 46(49),2013

Courtesy of M. Pistone at U. Bristol

How much is a TB, really?

If you looked at one 1000 x 1000 sized image every second

Mammography Image

It would take you
139 hours to browse through a terabyte of data.

Year Time to 1 TB Man power to keep up Salary Costs / Month
2000 4096 min 2 people 25 kCHF
2008 1092 min 8 people 95 kCHF
2014 32 min 260 people 3255 kCHF
2016 2 min 3906 people 48828 kCHF

Challenge mouse brain: Full Vascular Imaging

  • Measure a full mouse brain (1cm\( ^3 \)) with cellular resolution (1 \( \mu \) m)
  • 10 x 10 x 10 scans at 2560 x 2560 x 2160 \( \rightarrow \) 14 TVoxels
  • 0.000004% of the entire dataset
    Single Slice
  • 14TVoxels = 56TB
  • Each scan needs to be registered and aligned together
  • There are no computers with 56TB of memory
  • Analysis of the stitched data is also of interest (segmentation, vessel analysis, distribution and network connectivity)

The raw data

plot of chunk unnamed-chunk-7

What is wrong with usual approaches?

Normally when problems are approached they are solved for a single task as quickly as possible

  • I need to filter my image with a median filter with a neighborhood of 5 x 5 and a square kernel
  • then make a threshold of 10
  • label the components
  • then count how many voxels are in each component
  • save it to a file
im_in=imread('test.jpg');
im_filter=medfilt2(im_in,[5,5]);
cl_img=bwlabel(im_filter>10);
cl_count=hist(cl_img,1:100);
dlmwrite(cl_count,'out.txt')
  • What if you want to compare Gaussian and Median?
  • What if you want to look at 3D instead of 2D images?
  • What if you want to run the same analysis for a folder of images?

You have to rewrite everything, everytime

If you start with a bad approach, it is very difficult to fix, big data and reproducibility must be considered from the beginning

Image Processing in Java

final int[][][] outMap = new int[nSize.x * 2 + 1][nSize.z * 2 + 1][nSize.z * 2 + 1];
int dz = 0;
for (int z2 = max(z - nSize.z, lowz); z2 <= min(z + nSize.z, uppz - 1); z2++, dz++) {
    int dy = 0;
    for (int y2 = max(y - nSize.y, lowy); y2 <= min(y + nSize.y,
            uppy - 1); y2++, dy++) {
        int off2 = (z2 * dim.y + y2) * dim.x + max(x - nSize.x, lowx);
        int dx = 0;
        for (int x2 = max(x - nSize.x, lowx); x2 <= min(x + nSize.x,
                uppx - 1); x2++, off2++, dx++) {
            outMap[dx][dy][dz] = distmap[off2];
        }
    }
}
return outMap;

Problems

  • Verbose
  • Not Scalable (array length is a short)
    • Too small for 2048x2048x2048
  • Heap / Memory Problems
  • Verbose, fragile code

Computer Science Principles

Disclosure : There are entire courses / PhD thesis's / Companies about this, so this is just a quick introduction

  • Parallelism
  • Distributed Computing
  • Resource Contention
    • Shared-Memory
    • Race Conditions
    • Synchronization
    • Dead lock
  • Imperative
  • Declarative

What is parallelism?

Parallelism is when you can divide a task into separate pieces which can then be worked on at the same time.

For example

  • if you have to walk 5 minutes and talk on the phone for 5 minutes
  • you can perform the tasks serially which then takes 10 minutes
  • you can perform the tasks in parallel which then takes 5 minutes

Some tasks are easy to parallelize while others are very difficult. Rather than focusing on programming, real-life examples are good indicators of difficultly.

What is distributed computing?

Distributed computing is very similar to parallel computing, but a bit more particular. Parallel means you process many tasks at the same time, while distributed means you are no longer on the same CPU, process, or even on the same machine.

The distributed has some important implications since once you are no longer on the same machine the number of variables like network delay, file system issues, and other users becomes a major problem.

Distributed Computing Examples

  1. You have 10 friends who collectively know all the capital cities of the world.
    • To find the capital of a single country you just yell the country and wait for someone to respond (+++)
    • To find who knows the most countries, each, in turn, yells out how many countries they know and you select the highest (++)
  1. Each friend has some money with them
    • To find the total amount of money you tell each person to tell you how much money they have and you add it together (+)
    • To find the median coin value, you ask each friend to tell you you all the coins they have and you make one master list and then find the median coin (-)

Resource Contention

The largest issue with parallel / distributed tasks is the need to access the same resources at the same time

  • memory / files
  • pieces of information
  • network resources

Dead-lock

Dining Philopher's Problem

  • 6 philosophers at the table
  • 6 forks
  • Everyone needs two forks to eat
  • Each philospher takes the fork on his left

Parallel Challenges

Coordination

Parallel computing requires a significant of coordinating between computers for non-easily parallelizable tasks.

Mutability

The second major issue is mutability, if you have two cores / computers trying to write the same information at the same it is no longer deterministic (not good)

Blocking

The simple act of taking turns and waiting for every independent process to take its turn can completely negate the benefits of parallel computing

Parallelizing Threading

for (int i = 0; i < nCores; i++) {
    if (bfArray[i] != null) {
        try {
            System.out.println("Joining " + bfArray[i]);
            bfArray[i].join(); // wait until thread has finished
        } catch (final InterruptedException e) {
            System.out.println("ERROR - Thread : " + bfArray[i]
                    + " was interrupted, proceed carefully!");
        }
    }
}
public synchronized void addBubble(final SeedLabel nBubble) {
    bubList.append(nBubble);
    completedBubbles++;
}

Distributed Challenges

Inherits all of the problems of parallel programming with a whole variety of new issues.

Sending Instructions / Data Afar

Fault Tolerance

If you have 1000 computers working on solving a problem and one fails, you do not want your whole job to crash

Data Storage

How can you access and process data from many different computers quickly without very expensive infrastructure

A Cluster?

Basic Topology

  • Driver Node

The machine being used by the user which is responsible for creating jobs which need to run

  • Master Node

Distributes task across multiple machines, coordinates communication between machines

  • Worker Nodes

The nodes where the compuation is actually done

pre-main prep time: 1 ms

Using a cluster

  • \( \rightarrow \) Tell the driver to load all images
  • The driver determines which data need to be loaded
  • Driver \( \rightarrow \) Master exact data to load

pre-main prep time: 1 ms

Steps

  • tell machine 1 to load images 0 to 2
  • tell machine 2 to load images 3 to 5

pre-main prep time: 3 ms

pre-main prep time: 0 ms

Exchanging Images

tell share the images to calculate the overlap

  • \( \rightarrow \) tell machine 1 to send image 0 to machine 2
  • \( \rightarrow \) tell machine 2 to calculate: \( \textrm{Overlap}(A\rightarrow B) \)

pre-main prep time: 1 ms

Computing has changed: Parallel

Moores Law

\[ \textrm{Transistors} \propto 2^{T/(\textrm{18 months})} \]

Based on trends from Wikipedia and Intel Based on data from https://gist.github.com/humberto-ortiz/de4b3a621602b78bf90d

There are now many more transistors inside a single computer but the processing speed hasn't increased. How can this be?

  • Multiple Core
    • Many machines have multiple cores for each processor which can perform tasks independently
  • Multiple CPUs
    • More than one chip is commonly present
  • New modalities
    • GPUs provide many cores which operate at slow speed

Parallel Code is important

Cloud Computing Costs

The figure shows the range of cloud costs (determined by peak usage) compared to a local workstation with utilization shown as the average number of hours the computer is used each week.

plot of chunk unnamed-chunk-11

The figure shows the cost of a cloud based solution as a percentage of the cost of buying a single machine. The values below 1 show the percentage as a number. The panels distinguish the average time to replacement for the machines in months

plot of chunk unnamed-chunk-12

Imperative Programming

Directly coordinating tasks on a computer.

  • Languages like C, C++, Java, Matlab
  • Exact orders are given (implicit time ordering)
  • Data management is manually controlled
  • Job and task scheduling is manual
  • Potential to tweak and optimize performance

Making a soup

  1. Buy vegetables at market
  2. then Buy meat at butcher
  3. then Chop carrots into pieces
  4. then Chop potatos into pieces
  5. then Heat water
  6. then Wait until boiling then add chopped vegetables
  7. then Wait 5 minutes and add meat

Declarative

  • Languages like SQL, Erlang, Haskell, Scala, Python, R can be declarative
  • Goals are stated rather than specific details
  • Data is automatically managed and copied
  • Scheduling is automatic but not always efficient

Making a soup

  • Buy vegetables at market \( \rightarrow shop_{veggies} \)
  • Buy meat at butcher \( \rightarrow shop_{meat} \)
  • Wait for \( shop_{veggies} \): Chop carrots into pieces \( \rightarrow chopped_{carrots} \)
  • Wait for \( shop_{veggies} \): Chop potatos into pieces \( \rightarrow chopped_{potatos} \)
  • Heat water
  • Wait for \( boiling water \),\( chopped_{carrots} \),\( chopped_{potatos} \): Add chopped vegetables
    • Wait 5 minutes and add meat

Comparison

They look fairly similar, so what is the difference? The second is needlessly complicated for one person, but what if you have a team, how can several people make an imparitive soup faster (chopping vegetables together?)

Imperative soup

  1. Buy {carrots, peas, tomatoes} at market
  2. then Buy meat at butcher
  3. then Chop carrots into pieces
  4. then Chop potatos into pieces
  5. then Heat water
  6. then Wait until boiling then add chopped vegetables
  7. then Wait 5 minutes and add meat

How can many people make a declarative soup faster? Give everyone a different task (not completely efficient since some tasks have to wait on others)

Declarative soup

  • Buy {carrots, peas, tomatoes} at market \( \rightarrow shop_{veggies} \)
  • Buy meat at butcher \( \rightarrow shop_{meat} \)
  • Wait for \( shop_{veggies} \): Chop carrots into pieces \( \rightarrow chopped_{carrots} \)
  • Wait for \( shop_{veggies} \): Chop potatos into pieces \( \rightarrow chopped_{potatos} \)
  • Heat water
  • Wait for \( boiling_{water} \),\( chopped_{carrots} \),\( chopped_{potatos} \): Add chopped vegetables
    • Wait 5 minutes and add meat

Results

Imperative

  • optimize specific tasks (chopping vegetables, mixing) so that many people can do it faster
    • Matlab/Python do this with fast-fourier-transforms (automatically uses many cores to compute faster)
  • make many soups at the same time (independent)
    • This leads us to cluster-based computing

Declarative

  • run everything at once
  • each core (computer) takes a task and runs it
  • execution order does not matter
    • wait for portions to be available (dependency)

Lazy Evaluation

  • do not run anything at all
  • until something needs to be exported or saved
  • run only the tasks that are needed for the final result
    • never buy tomatoes since they are not in the final soup

The Problem

There is a flood of new data

What took an entire PhD 3-4 years ago, can now be measured in a weekend, or even several seconds. Analysis tools have not kept up, are difficult to customize, and usually highly specific.

Optimized Data-Structures do not fit

Data-structures that were fast and efficient for computers with 640kb of memory do not make sense anymore

Single-core computing is too slow

CPU's are not getting that much faster but there are a lot more of them. Iterating through a huge array takes almost as long on 2014 hardware as 2006 hardware

Exploratory Image Analysis Priorities

Correctness

The most important job for any piece of analysis is to be correct.

  • A powerful testing framework is essential
  • Avoid repetition of code which leads to inconsistencies
  • Use compilers to find mistakes rather than users

Easily understood, changed, and used

Almost all image processing tasks require a number of people to evaluate and implement them and are almost always moving targets

  • Flexible, modular structure that enables replacing specific pieces

Fast

The last of the major priorities is speed which covers both scalability, raw performance, and development time.

  • Long waits for processing discourages exploration
  • Manual access to data on separeate disks is a huge speed barrier
  • Real-time image processing requires millisecond latencies
  • Implementing new ideas can be done quickly

The Framework First

  • Rather than building an analysis as quickly as possible and then trying to hack it to scale up to large datasets
    • chose the framework first
    • then start making the necessary tools.
  • Google, Amazon, Yahoo, and many other companies have made huge in-roads into these problems
  • The real need is a fast, flexible framework for robustly, scalably performing complicated analyses, a sort of Excel for big imaging data.

A brief oversimplified story

Google ran into 'big data' and its associated problems years ago: Peta- and exabytes of websites to collect and make sense of. Google uses an algorithm called PageRank™ for evaluating the quality of websites. They could have probably used existing tools if page rank were some magic program that could read and determine the quality of a site

for every_site_on_internet
  current_site.rank=secret_pagerank_function(current_site)
end

Just divide all the websites into a bunch of groups and have each computer run a group, easy!

PageRank

While the actual internals of PageRank are not public, the general idea is that sites are ranked based on how many sites link to them

for current_site in every_site_on_internet
  current_pagerank = new SecretPageRankObj(current_site);
  for other_site in every_site_on_internet
    if current_site is_linked_to other_site
      current_pagerank.add_site(other_site);
    end
  end
  current_site.rank=current_pagerank.rank();
end

How do you divide this task?

  • Maybe try and divide the sites up: english_sites, chinese_sites, …
    • Run pagerank and run them separately.
    • What happens when a chinese_site links to an english_site?
  • Buy a really big, really fast computer?
    • On the most-powerful computer in the world, one loop would take months

It gets better

  • What happens if one computer / hard-drive crashes?
    • Have a backup computer replace it (A backup computer for every single system)
    • With a few computers ok, with hundreds of thousands of computers?
    • What if there is an earthquake and all the computers go down?
  • PageRank doesn't just count
    • Uses the old rankings for that page
    • Run pagerank many times until the ranks converge

Google's Solution: MapReduce (part of it)

some people claim to have had the idea before, Google is certainly the first to do it at scale

Several engineers at Google recognized common elements in many of the tasks being performed. They then proceeded to divide all tasks into two classes Map and Reduce

Map

Map is where a function is applied to every element in the list and the function depends only on exactly that element \[ \vec{L} = \begin{bmatrix} 1,2,3,4,5 \end{bmatrix} \] \[ f(x) = x^2 \] \[ map(f \rightarrow \vec{L}) = \begin{bmatrix} 1,4,9,16,25 \end{bmatrix} \]

Reduce

Reduce is more complicated and involves aggregating a number of different elements and summarizing them. For example the \( \Sigma \) function can be written as a reduce function \[ \vec{L} = \begin{bmatrix} 1,2,3,4,5 \end{bmatrix} \] \[ g(a,b) = a+b \] Reduce then applies the function to the first two elements, and then to the result of the first two with the third and so on until all the elements are done \[ reduce(f \rightarrow \vec{L}) = g(g(g(g(1,2),3),4),5) \]

MapReduce

They designed a framework for handling distributing and running these types of jobs on clusters. So for each job a dataset (\( \vec{L} \)), Map-task (\( f \)), a grouping, and Reduce-task (\( g \)) are specified

  • Partition input data (\( \vec{L} \)) into chunks across all machines in the cluster \[ \downarrow \]
  • Apply Map (\( f \)) to each element \[ \downarrow \]
  • Shuffle and Repartition or Group Data \[ \downarrow \]
  • Apply Reduce (\( g \)) to each group \[ \downarrow \]
  • Collect all of the results and write to disk

All of the steps in between can be written once in a robust, safe manner and then used for every task which can be described using this MapReduce paradigm. These tasks \( \langle \vec{L}, f(x), g(a,b) \rangle \) is refered to as a job.

Key-Value Pairs / Grouping

The initial job was very basic, for more complicated jobs, a new notion of Key-value (KV) pairs must be introduced. A KV pair is made up of a key and value. A key must be comparable / hashable (a number, string, immutable list of numbers, etc) and is used for grouping data. The value is the associated information to this key.

Counting Words

Using MapReduce on a folder full of text-documents.

  • \[ \vec{L} = \begin{bmatrix} "\textrm{Info}\cdots", "\textrm{Expenses}\cdots",\cdots \end{bmatrix} \]
  • Map is then a function \( f \) which takes in a long string and returns a list of all of the words (text seperated by spaces) as key-value pairs with the value being the number of times that word appeared
  • f(x) = [(word,1) for word in x.split(" ")]
  • Grouping is then performed by keys (group all words together)
  • Reduce adds up the values for each word
L = ["cat dog car",
  "dog car dog"]

\[ \downarrow \textbf{ Map } : f(x) \]

[("cat",1),("dog",1),("car",1),("dog",1),("car",1),("dog",1)]

\[ \downarrow \textrm{ Shuffle / Group} \]

"cat": (1)
"dog": (1,1,1)
"car": (1,1)

\[ \downarrow \textbf{ Reduce } : g(a,b) \]

[("cat",1),("dog",3),("car",2)]

Hadoop

Hadoop is the opensource version of MapReduce developed by Yahoo and released as an Apache project. It provides underlying infrastructure and filesystem that handles storing and distributing data so each machine stores some of the data locally and processing jobs run where the data is stored.

  • Non-local data is copied over the network.
  • Storage is automatically expanded with processing power.
  • It's how Amazon, Microsoft, Yahoo, Facebook, … deal with exabytes of data

Apache Spark

  • “The Ultimate Scala Collections”- Martin Odersky (EPFL / Creator of Scala)

  • “MapReduce on Steroids”

  • Developed by the Algorithms, Machines, and People Lab at UC Berkeley in 2012

  • General tool for all Directed Acyclical Graph (DAG) workflows

  • Course-grained processing \( \rightarrow \) simple operations applied to entire sets

    • Map, reduce, join, group by, fold, foreach, filter,…
  • In-memory caching

Zaharia, M., et. al (2012). Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Ultimate Collections?

  • Standard Processing
    • In Java
int[] newImage = new int[inputImage.length];
for(int i=0;i<inputImage.length; i++) {
  if(inputImage[i]>0) {
    newImage[i] = inputImage[i];
  } 
}
  • In Scala
val newImage = inputImage.map(v => if(v>0) v else 0)

With Spark

Looks the same as Scala, what is the difference?

val newImage = inputImage.map(v => if(v>0) v else 0)

It is being performed on a cluster where the computation is not only divided between all available nodes, but also locality-aware meaning it moves the computation where the data is.

With Spark (Java)

A bit more verbose, but the idea remains the same.

RDD<Integer> newImage = inputImage.map<Integer,Integer>(
  new Function1<Integer,Integer>() {
    Integer execute(Integer x) {
      if(x>0) return x;
      return new Integer(0);
    }
  }
});

Any possible problems?

Image Query and Analysis Engine

These frameworks are really cool and Apache Spark has a lot of functionality , but I don't want to map-reduce a collection.

I want to

  • filter out noise, segment, choose regions of interest
  • contour, component label
  • measure, count, and analyze

Image Query and Analysis Engine (IQAE)

  • Developed at 4Quant, ETH Zurich, and Paul Scherrer Institut
  • The IQAE is a Domain Specific Language for Microscopy for Spark.
  • It converts common imaging tasks into coarse-grained Spark operations

SIL

IQAE

We have developed a number of commands for IQAE handling standard image processing tasks

SIL Commands

Fully exensible with Spark Languages

SIL Commands

How do we get started?

  • First we start our cluster:
    spark-ec2.py -s 50 launch 4quant-image-cluster

  • Load all of the samples (56TB of data)

loadImage(
  "s3:/../brain_*_*_*/rec.tif"
  )

\[ \downarrow \] A Resilient Distributed Dataset (RDD) \[ \textrm{Images}: \textrm{RDD}[((x,y,z),Img[Double])] =\\ \left[(\vec{x},\textrm{Img}),\cdots\right] \]

  • Now start processing, run a gaussian filter on all the images to reduce the noise
filteredImages = Images.gaussianFilter(1,0.5)
  • Calculate the volume fraction at a threshold of 50%
volFraction = Images.threshold(0.5).
  map{keyImg =>
    (sum(keyImg.img),keyImg.size) 
    }.reduce(_+_)

Stitching?

We have all of the filtered images and we want to stitch them together in a smart way.

  • Start simple and match the images up with its neighbors (all image permutations and filter out the proximal ones)
pairImages = Images.
  cartesian(Images).
  filter((im1,im2) => dist(im1.pos,im2.pos)<1)

plot of chunk cartiscan

Cross Correlating

The cross correlation can then be executed on each pair of images from the new RDD (pairImages) by using the map command

displacementField = pairImages.
  map{
  ((posA,ImgA), (posB,ImgB)) =>

    xcorr(ImgA,ImgB,
      in=posB-posA)

  }

plot of chunk unnamed-chunk-17

From Matching to Stitching

From the updated information provided by the cross correlations and by applying appropriate smoothing criteria across windows.

smoothField = displacementField.
    window(3,3,3).
    map(gaussianSmoothFunction)

This also ensures the original data is left unaltered and all analysis is reversible.

Viewing Regions

The final stiching can then be done by

alignImages.
  filter(x=>abs(x-tPos)<img.size).
  map { (x,img) =>
   new Image(tSize).
    copy(img,x,tPos)
  }.combineImages()

plot of chunk unnamed-chunk-20

Exploring the data

IQAE: Pathological Slices

  • Count the number of highly anisotropic nuclei in myeloma patients

\[ \downarrow \textrm{Translate to SQL} \]

SELECT COUNT(*) FROM 
  (SELECT SHAPE_ANALYSIS(LABEL_NUCLEI(pathology_slide)) FROM patients 
  WHERE disease LIKE "myleoma")
  WHERE anisotropy > 0.75

IQAE: Pathological Slices

\[ \downarrow \textrm{Load Myleoma Data Subset} \] Slices

IQAE: Pathological Slices

\[ \downarrow \textrm{Perform analysis on a every image} \] Slices

\[ \downarrow \textrm{Filter out the most anisotropic cells} \] Slices

IQAE: Maps

PSI

Find all strongly reflective large objects within 1km of Paul Scherrer Intitute, Villigen, CH

\[ \downarrow \textrm{Translate to SQL} \]

SELECT contour FROM (
  SELECT COMPONENT_LABEL(THRESHOLD(tile,200)) FROM esriTiles 
  WHERE DIST(LAT,-47.53000992328762,LONG,8.215198516845703)<1
  ) WHERE area>200

IQAE: Maps

We can then visualize these contours easily

plot of chunk region-lat-long-image

or apply them back to the original map

PSI

Spark / Resilient Distributed Datasets

Practical Specification

  • Distributed, parallel computing without logistics, libraries, or compiling
  • Declarative rather than imperative
    • Apply operation \( f \) to each image / block
    • NOT tell computer 3 to wait for an image from computer 2 to and perform operation \( f \) and send it to computer 1
    • Even scheduling is handled automatically
  • Results can be stored in memory, on disk, redundant or not

We have a cool tool, but what does this mean for me?

ETH Spinoff - 4Quant: From images to insight

  • Quantitative Search Machine for Images
    • Make analyzing huge sets of image data like a database search
  • Custom Analysis Solutions
    • Custom-tailored software to solve your problems
    • Streaming / Acquisition Solutions
    • Integration with Image Database Systems
    • Cloud Solutions
  • Use our API for your projects

Education / Training

We support Open, Reproducible Science

While our framework is commercial, we build on top and integrate into open-source tools so that the research is reproducible by anyone, anywhere, using any machine (it just might take a bit longer)

We also try to publish as much code, educational material, and samples as possible on our github account.

Github

Acknowledgements: 4Quant

Acknowledgements: ETH and PSI

We are interested in partnerships and collaborations

Learn more at

Hadoop Filesystem (HDFS not HDF5)

Bottleneck is filesystem connection, many nodes (10+) reading in parallel brings even GPFS-based infiniband system to a crawl

SIL

One of the central tenants of MapReduce™ is data-centric computation \( \rightarrow \) instead of data to computation, move the computation to the data.

  • Use fast local storage for storing everything redundantly \( \rightarrow \) less transfer and fault-tolerance
  • Largest file size: 512 yottabytes, Yahoo has 14 petabyte filesystem in use

SIL